open source dataset
Abstractive Text Summarization for Resumes With Cutting Edge NLP Transformers and LSTM
Mercan, Öykü Berfin, Cavsak, Sena Nur, Deliahmetoglu, Aysu, Tanberk, Senem
Text summarization is a fundamental task in natural language processing that aims to condense large amounts of textual information into concise and coherent summaries. With the exponential growth of content and the need to extract key information efficiently, text summarization has gained significant attention in recent years. In this study, LSTM and pre-trained T5, Pegasus, BART and BART-Large model performances were evaluated on the open source dataset (Xsum, CNN/Daily Mail, Amazon Fine Food Review and News Summary) and the prepared resume dataset. This resume dataset consists of many information such as language, education, experience, personal information, skills, and this data includes 75 resumes. The primary objective of this research was to classify resume text. Various techniques such as LSTM, pre-trained models, and fine-tuned models were assessed using a dataset of resumes. The BART-Large model fine-tuned with the resume dataset gave the best performance.
22 open source datasets to boost AI modeling
We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 - 28. Join AI and data leaders for insightful talks and exciting networking opportunities. Some say, "data is the new oil," with an air of seriousness. And while the phrase may capture a certain truth about the modern digital economy, it fails to model the way that bits can be copied again and again. Sometimes the ease of sharing creates a distinct absence of scarcity and that changes the economics of the entire game. One of the best ways to visualize this is to tap into some open source datasets that are proliferating on the Internet.
Council Post: What Exactly Is Artificial Intelligence? (Hint: It's All About The Datasets)
Boris Kontsevoi is a technology executive, President and CEO of Intetics Inc., a global software engineering and data processing company. Many of today's emerging technologies and products heavily rely on artificial intelligence (AI) and machine learning (ML). And while there are hundreds of articles written about this topic, very few get into the nitty gritty of what truly powers AI: data. The definition of artificial intelligence varies depending who you ask. A data scientist will have a much different answer than someone who is just peripherally aware of AI.
Open Source Datasets for Computer Vision - KDnuggets
Computer Vision (CV) is one of the most exciting subfields within the Artificial Intelligence (AI) and Machine Learning (ML) domain. It is a major component for many modern AI/ML pipelines, and it's transforming almost every industry, enabling organizations to revolutionize the way machines and business systems work. Academically, CV has been a well-established area of computer science for many decades, and over the years, a lot of research has gone into this field to make it better. However, the use of deep neural networks has recently revolutionized the field and given it new fuel for accelerated growth. In this article, we discuss some of the most popular and effective datasets used in the domain of Deep Learning (DL) to train state-of-the-art ML systems for CV tasks.
An A.I. Training Tool Has Been Passing Its Bias to Algorithms for Almost Two Decades
Night after night, Fien de Meulder sat in front of her Linux computer flagging names of people, places, and organizations in sentences pulled from Reuters newswire articles. De Meulder and her colleague, Erik Tjong Kim Sang, worked in language technology at the University of Antwerp. It was 2003, and a 60-hour workweek was typical in academic circles. She chugged Coke to stay awake. The goal: develop an open source dataset to help machine learning (ML) models learn to identify and categorize entities in text.
Council Post: What Exactly Is Artificial Intelligence? (Hint: It's All About The Datasets)
Boris Kontsevoi is a technology executive, President and CEO of Intetics Inc., a global software engineering and data processing company. Many of today's emerging technologies and products heavily rely on artificial intelligence (AI) and machine learning (ML). And while there are hundreds of articles written about this topic, very few get into the nitty gritty of what truly powers AI: data. The definition of artificial intelligence varies depending who you ask. A data scientist will have a much different answer than someone who is just peripherally aware of AI.
10 Best Entry Level Machine Learning Tutorials
The field of machine learning is becoming easier and easier to enter thanks to readily available tools, a wide range of open source datasets, and a community open to sharing ideas and giving advice. Almost everything you need to get started is online; it's just a matter of finding it. To help entry-level enthusiasts get their head around different ML systems and how to implement them, I've put together some of my favorite machine learning tutorials. All of the following articles provide a brief introduction to the systems being covered, talk you through the cleaning, testing, and implementation process, and also provide links to datasets and Gitub repositories so you can follow the same steps on your own. This detailed guide explores transformer architecture by creating a translator that takes an English sentence and translates it to German. It covers data preprocessing, model training, and wraps things up by looking at the results and what could be done to improve the system.
Open Source Dataset and Machine Learning Techniques for Automatic Recognition of Historical Graffiti
Gordienko, Nikita, Gang, Peng, Gordienko, Yuri, Zeng, Wei, Alienin, Oleg, Rokovyi, Oleksandr, Stirenko, Sergii
Machine learning techniques are presented for automatic recognition of the historical letters (XI-XVIII centuries) carved on the stoned walls of St.Sophia cathedral in Kyiv (Ukraine). A new image dataset of these carved Glagolitic and Cyrillic letters (CGCL) was assembled and pre-processed for recognition and prediction by machine learning methods. The dataset consists of more than 4000 images for 34 types of letters. The explanatory data analysis of CGCL and notMNIST datasets shown that the carved letters can hardly be differentiated by dimensionality reduction methods, for example, by t-distributed stochastic neighbor embedding (tSNE) due to the worse letter representation by stone carving in comparison to hand writing. The multinomial logistic regression (MLR) and a 2D convolutional neural network (CNN) models were applied. The MLR model demonstrated the area under curve (AUC) values for receiver operating characteristic (ROC) are not lower than 0.92 and 0.60 for notMNIST and CGCL, respectively. The CNN model gave AUC values close to 0.99 for both notMNIST and CGCL (despite the much smaller size and quality of CGCL in comparison to notMNIST) under condition of the high lossy data augmentation. CGCL dataset was published to be available for the data science community as an open source resource.